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Abstract 

Background: Protein structure alignment is often modeled as the largest common point set (LCP) problem based on 
the Root Mean Square Deviation (RMSD), a measure commonly used to evaluate structural similarity. In the problem, 
each residue is represented by the coordinate of the Ca atom, and a structure is modeled as a sequence of 3D points. 
Out of two such sequences, one is to find two equal-sized subsequences of the maximum length, and a bijection 
between the points of the subsequences which gives an RMSD within a given threshold. The problem is considered 
to be difficult in terms of time complexity, but the reasons for its difficulty is not well-understood. Improving this time 
complexity is considered important in protein structure prediction and structural comparison, where the task of 
comparing very numerous structures is commonly encountered. 

Results: To study why the LCP problem is difficult, we define a natural variant of the problem, called the minimum 
aligned distance (MAD). In the MAD problem, the length of the subsequences to obtain is specified in the input; and 
instead of fulfilling a threshold, the RMSD between the points of the two subsequences is to be minimized. Our results 
show that the difficulty of the two problems does not lie solely in the combinatorial complexity of finding the optimal 
subsequences, or in the task of superimposing the structures. By placing a limit on the distance between consecutive 
points, and assuming that the points are specified as integral values, we show that both problems are equally difficult, 
in the sense that they are reducible to each other. In this case, both problems can be exactly solved in polynomial 
time, although the time complexity remains high. 

Conclusions: We showed insights and techniques which we hope will lead to practical algorithms for the LCP 
problem for protein structures. The study identified two important factors in the problem's complexity: (1 ) The lack of 
a limit in the distance between the consecutive points of a structure; (2) The arbitrariness of the precision allowed in 
the input values. Both issues are of little practical concern for the purpose of protein structure alignment. When these 
factors are removed, the LCP problem is as hard as that of minimizing the RMSD (MAD problem), and can be solved 
exactly in polynomial time. 
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Background 

A common approach to understand the properties of a 
protein is to compare it to other proteins. Proteins that are 
similar, in terms of either their amino acid sequences or 3- 
dimensional structures, often share similar functions, or 
are related evolutionarily. The latter, structural compari- 
son, is particularly interesting since protein structures are 
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known to be more evolutionarily conserved than the bio- 
logical sequences which encode them. Furthermore, pro- 
teins of similar structures may have similar functionality, 
even when their sequences differ [1]. 

Structural comparison is typically a problem of align- 
ing two sets of 3-dimensional coordinates. (In most of the 
known structural alignment problems, each point is the 
3D coordinates of the Ca atom, one per residue. Hence, 
a structure can be modeled for structural alignment pur- 
pose as a sequence of 3D points.) The alignment usually 
involves a rigid transformation to superimpose the two 
sequences of points, and a mapping which specifies the 
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matched points. The parameters to optimize in the align- 
ment may differ in different situations, because it is not 
easy to single out a set of parameters that best captures 
the similarity between two given structures [2]. In many 
situations, the alignment needs not match between every 
point in the two sequences. At present, there is a consen- 
sus among molecular biologists in the use of the following 
two parameters [2-4]: 

1. the number of residues (points) or percentage of total 
residues (points) matched in the alignment. 

2. the root mean square deviation (RMSD) of the 
matched residues (points). 

In general, the RMSD need not be minimized. It suf- 
fices that it is within a reasonable threshold. Hence, a 
good alignment is customarily taken to be one which max- 
imizes the number of residue matches, within a given 
RMSD threshold. Many structural alignment methods are 
based on this principle. The computational complexity of 
finding an optimal solution to the problem is not well 
understood. Shibuya et al. formulated a restricted ver- 
sion of the problem, and showed the problem to NP-hard 
when the dimensionality is arbitrary. It is open whether 
their problem is NP-hard in 3-dimension [5]. Other prob- 
lems related to structural comparison based on the RMSD 
have been found to be difficult. For example, the prob- 
lem of finding a substructure from multiple 3-dimensional 
structures which minimizes the total RMSD, is NP- 
hard [6]. 

For the variants of the alignment problem that are not 
based on the RMSD, we have the following results. When 
the objective is to maximize the number of point matches 
which are no more than a threshold distance apart, the 
problem is solvable in 0(w 32,5 ) time, where n is the num- 
ber of points [7]. The contact map overlap problem, where 
a graph is created out of each structure, and the prob- 
lem is one of comparing the two graphs, is NP-hard 
[8], and remains NP-hard even when we require points 
that are matchable to be within a threshold distance [9]. 
These results, together with an early result which shows 
a related problem called threading to be NP-hard [10], 
have traditionally led molecular biologists to believe that 
the structural alignment problem is difficult in general 
(e.g. [11-13]), even though a PTAS exists for the problem 
under a broad class of distance measures [14]. Heuristic 
algorithms have also been proposed for many variants of 
structural alignment problem [15-23]. While these meth- 
ods perform reasonably well in general, they provide no 
guarantee on the quality of their results. 

As noted by Shibuya et al., relatively few theoretical 
results have been obtained on problems defined over the 
RMSD, and the general problem of structural alignment 
under the RMSD remains open [5]. At present, whether 



the problem is intractable or not is not only of theoretical 
interests but also of practical concerns, due to advances in 
protein structure prediction which requires the compar- 
ison of very numerous structures. In this paper we show 
mathematical insights and techniques which we hope will 
lead to practical algorithms for the problem. 

We first show that the difficulty of the problem does not 
lie solely in the individual components of their require- 
ment. More precisely, 

- if either a mapping that contains the optimal 
mapping is known (Theorem 3), or 

- if the optimal superposition is known (Lemma 1), 

then the problem can be solved in polynomial time. 

Our study shows that the difficulty of the LCP problem 
is also very much due to the two factors: (1) the problem 
allows the input coordinates to be of any arbitrary preci- 
sion, and (2) it assumes no limit on the distance between 
two consecutive Ca atoms. 

We consider the case where the input coordinates are 
integral, and the distance between two consecutive points 
is restricted. The first requirement is practical since in 
protein structures, coordinates are typically specified to a 
fixed precision (e.g. three decimal places in protein struc- 
tures [24]), and can be trivially scaled up to integral values. 
Similar assumptions are made in Euclidean problems such 
as the Euclidean TSP [25]. The second requirement like- 
wise does not add any restriction to the problem of protein 
structure alignment, since there is a natural upper bound 
(~3.8A) to the distance between two Ca atoms. In this 
case, the following results hold. 

- Given a polynomial time algorithm for finding a 
largest alignment of RMSD below a threshold d, one 
can efficiently compute an alignment of a given size I 
which minimizes the RMSD (Theorem 7). (Since the 
other direction is easy, this shows that the two 
problems are of similar difficulty.) 

- The structural alignment problem under the RMSD 
is solvable exactly in polynomial time (Theorem 10). 

Preliminary 

Notations and definitions 

A protein structure for alignment purpose is modeled 
as a finite, ordered sequence of three dimensional (3D) 
points. Hence, a structure of n residues is written as 
(pi<Pi< —>Pn)> where each point pi e R 3 . In the 'Results 
assuming integral coordinates and restricting distance 
between points' section, we will further assume each pi to 
be integral. We write P' C P iff P' is a subsequence of P, 
and write / : P i-> Q iff/ is a mapping which maps points 
in the sequence P to points in the sequence Q. 
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Problem statements 

We now state our problems. The main problem we con- 
sider is the largest common point set (LCP) problem 
under the RMSD, a well-known problem in protein struc- 
ture alignment. In the LCP, the objective is to find a 
mapping of the largest cardinality where the RMSD of 
the matched points is no more than a given threshold 
(Table 1). 

We do not require the optimal superposition of P and 
Q in the output, since that can be computed from P, Q', 
and / in linear time [26]. We refer to / as an alignment. An 
alignment can be sequential or non-sequential: an align- 
ment is sequential iff for any two points pi v pi 2 € P', 
where the corresponding fip^) = q^ and f{pi 2 ) = qj 2 , 
we have i\ < ii iff j\ < 72- Otherwise the alignment is 
non-sequential. The LCP problem which requires align- 
ments to be sequential is said to be sequential, otherwise 
it is non-sequential. We mainly discuss sequential align- 
ment in this paper. The techniques developed can be easily 
adapted to the non-sequential case. Given two equal-sized 
sequences P = (pi, . . . ,p n ) and Q', together with a bisec- 
tion / between P' and Q', the root mean square deviation 
(RMSD) is defined as 



RMSD(P, Q') = min , 



'i:i<i<n\\t(f(Pi))-Pi\\' 



(1) 



where t is a rigid transformation. The RMSD, with its cor- 
responding transformation t, can be computed in linear 
time [26]. 

A natural variant of the LCP problem is to minimize the 
RMSD instead of maximizing the size of the mapping, as 
follows. Given an integer I, find subsequences of size i 
of the input, such that the RMSD between the points of 
the subsequences is minimized. We call this problem the 
minimum aligned distance (MAD) problem (Table 2). 

Clearly, if the MAD problem is solvable in polynomial 
time, then the LCP problem is solvable in polynomial 
time. However, the other direction is unclear. Theorem 7 
will show that for P and Q of integral coordinates, if the 
LCP problem is solvable in polynomial time, then the 
MAD problem is solvable in polynomial time. 

Table 1 Largest Common Point (LCP) set problem 
definition under RMSD 

LCP problem by the RMSD measure 

Input: sequences P = (pi, . . . ,p n ), Q = (q-\, . . . ,q m ) and distance 

threshold Set Without loss of generality assume m > n. 

Output: (i) subsequences P> CP.ffC Q, \P'\ = |Q'|, and 

(ii) bijection f : f Q', fulfilling the following conditions: 

(A) ft/WSD(P',f(P')) < 0, 

(B) the score / = |P'| is maximized. 



Table 2 Minimum Aligned Distance (MAD) problem 
definition under RMSD 

MAD problem by the RMSD measure 

Input: sequences P= (pi, . . . ,p n ), Q = (qi,. . .,q m ) and I e I. 

Without loss of generality assume m > n. 
Output: (i) subsequences P' C P, Q' C Q, |P'| = |C|, and 

(ii) bijection f : f t-y Q', fulfilling the following conditions: 

(A) |/"| = e, 

(B) d = RMSDiP 1 , f(p')) is minimized. 

We let P, Q, f, I and d op t denote an optimal P , Q',f, I 
and d, respectively. The optimal rigid transformation for 
superimposing P and Q is denoted T, and can be com- 
puted from P, Q and f. The symbol c™ ax denotes the 
largest value in the coordinates of P, Cq" x the largest value 

}, and 



Q 

maxjc 



max max 

> c c 



in the coordinates of Q, and c mi 
we know that c max = 0(n) for protein structures. 



Results for general LCP and MAD 

Complexity of the LCP and MAD when the optimal 
superposition is known 

Since two point sequences with a known mapping can be 
superimposed optimally under the RMSD in linear time 
[26], it is natural to ask if the difficulty in LCP or MAD lies 
solely in the combinatorial complexity of finding the opti- 
mal subsets, i.e. P and Q. Our results show the contrary: if 
the optimal superposition T is known, both problems can 
be solved in polynomial time. 

We first consider the sequential case. Let d pA = \ \t(p) — 
q\\ 2 and let M[i,j;k] denote the minimum squared sum 
cost of k pair matches for the point sets (p\,p2> —>Pi) and 
(qi, qi, qj). If 1 < k < i, 2 < i < m and 2 </'<«, we 
have a recurrence relation of 



M[ i,j; k] = max 



M[ i - l,j - 1; k 

M[i,j-\;k], 

M[i-l,j;k] 



1] ~\~dp i: qj, 



(2) 



The base case of the recursion is obvious. Dynamic 
programming can be employed to fill up the respective 
M[ i,j; k] values. After all the values are filled, one can find 
the maximum k, such that the squared sum is no more 
than k6 2 for the LCP problem. The MAD problem can be 
solved similarly. 

The non-sequential case can be similarly solved using 
the maximum-flow minimum-cost problem [27]. The fol- 
lowing lemma states these results. 

Lemma 1. If an optimal transformation T is known, 
both the LCP problem and the MAD problem can be solved 
in O(mnl) time. 
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Complexity of the LCP and MAD when the matching 
between the point sets is known 

We next ask if the difficulty in the LCP and MAD could 
be due to the task of superimposing P and Q in an opti- 
mal manner to identify the subsets P and Q. To examine 
this possibility, we remove the combinatorial task of exam- 
ining each of the possible mapping between the points, 
by assuming a bijection F which contains the optimal 
mapping. (Note that this results in the problem known 
as model superposition in structural biology.) Again, our 
results show the resultant LCP and MAD problems to be 
solvable exactly in polynomial time. 

Assume that F is a bijection that maps points in P to 
points in Q. Let P' = (p' v . . . ,p' t ) be the domain of F and 
Q' = (c( v . . . , q'j) be the range of F. Without loss of gener- 
ality assume that F is sequential, and hence F(pJ) = q[. 

One can exhaustively evaluate all the subsequences of 
P' for one with the least RMSD. However, since the num- 
ber of such subsequences is exponential in I, this does not 
immediately give us a polynomial time solution. 

If a rigid transformation T for Q' is given, and all 
the pairs (p' it q'j), 1 < i < / are sorted according to 
the value - q^\\, the MAD problem is then to 

choose the first i pairs from (p'^q'j), 1 < i < /, and 
the LCP problem is to choose the first I pairs, such 
that RMSD((p' v ...,p' l ),(q' v ...,q' ( )) < 6 and I = I or 
RMSD((p[,...,p' e ,p' e+l ),(q' v -,q' v qi +1 )) > 0. This gives 
us an incentive to obtain a total ordering of ||7*(Pj) — q 1 ^ |, 
which will allow us to solve the MAD problem by selecting 
the first I pairs in the ordering. The set of transformations 
which produce the same total according to WTip'j) — ^||, 
yield the same result for the MAD problem, and therefore 
these transformations are equivalent. This enables us to 
design a discrete version of the problem. 

For clarity, we first present an algorithm with only trans- 
lation. 

With translations only 

Consider two pairs (p' i ,q'i) and (p'j,q'j). The transforma- 
tions T to separate the two types of transformations that 
UTiW-tiW > ||7i(pp - q'jW and 1 1 T 2 (q' i ) - p\\ \ < 
\\T 2 (p'j) — q'j 1 1 are the transformations where 

\\T(q' i )-p' i \\ 2 -\\T(p' j )-q! j \\ 2 = 0. (3) 

Let • denote dot product. If the transformation is a 
translation t, we have 

\\T(q' i )-p' i \\ 2 -\\T(q' ) )-p' i \\ 2 

= \Wi-p'i-t\\ 2 -\W i -p' j -t\\ 2 

= E H - m - v tf ~ E ^ " v P'i - v *f 

v=x,y,z v=x,y,z 

= H-p'tf-H-p'jW 2 

-2f((q' i -p' i )-(q'-p')) = 0 (4) 



Consider the space of all translation vectors in M 3 , and 
consider each vector as a point in this space (not the space 
that P and Q are in). The values that the variable t in 
Equation 4 may take form a plane in this translation space. 
The plane partitions the translation space into two types 
of translations, T\ and T2 say, where ||Ti(<^) — p' t \\ > 
mip'j) - q'jW and \\T 2 tf t ) - p\\\ < \\T 2 (pp - ^||. Since 
there are / pairs, there are 0(1) planes, which partition the 
space into 0(l 3 ) cells. 

The translations in each cell result in the same ordering 
of the pairs with respect to 1 1 T(p'j) — q'^ |. For each cell, this 
total order can be obtained in 0(1) time from any given 
total order of its neighbor cells, since the change is 0(1). 
Therefore, the MAD solution can be obtained in amor- 
tized time 0(1) for each cell, and the LCP solutions thus 
can be obtained in time 0(l 2 ). Hence the total runtime 
is of 0(£ 4 ) for the MAD problem, and 0(l s ) for the LCP 
problem of translations, with the given mapping F. 

With rigid transformations 

The rigid transformations which separate the two rela- 
tions liritej) -p'jU > \\Ti(pp - q-jW and \\T 2 tf,) -p\\\ < 
1 1 T2(p'j) — q'j\ \ are as in Equation 3. 

Suppose the rigid transformations T is composed of a 
rotation 7? and a translation t. 

\\T(q' i )-p' i \\ 2 -\\T(q' j )-p' j \\ 2 

= WR^-p'i-t^-WRty-p'j-tW 2 

= \\R(^ i )-p' i \\ 2 -\\R(q' l )-p' j \\ 2 

-2t.[ (R(q t ) - Pi ) - (R(qj) - pj)] (5) 

A rotation matrix contains three variables, which is 
specified using three angles, say ori, a 2 , 013, each from — n 
to 7T. Let Vi = cos a,, s; = cosa,, then Equation 5 can 
be considered as a polynomial of nine variables in degree 
six. The nine variables are r,-, S; and the three variables for 
translation. In total, there are 0(1) such polynomials. 

We know the following theorem from the literature. 

Theorem 2. [7,28] Given a set of k polynomials, V = 
[fit —tfk)> where each polynomial has a maximum degree 
of s, contains at most r variables, and in addition all the 
coefficients are rational, then all the sign conditions can 
be determined by 0(k(k/r) r s 0 ^) arithmetic operations. A 
sign condition V is the vector of signs for some point u € 
R*; that is, V = (sign(f\(u)), ...,sign(f/ ( (u))). Two points 
u, u' e M. k are equivalent if their sign condition vectors are 
the same. 

Each sign vector represents the transformations of the 
cell it belongs to, and it determines a total order of the 
pairs. Similar as in the case of translation, with Theorem 2, 
we have 
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Theorem 3. Given a bijection F : P' Q', where \P' \ = 
I, then the MAD problem can be solved in 0(l w ) time and 
the LCP problem can be solved in 0(l n ) time. 

Results assuming integral coordinates and 
restricting distance between points 

One possible contributing factor to the difficulty of 
the LCP problem could be its flexibility in allowing 
input coordinates of any arbitrary precision. This is 
because intuitively, this arbitrariness in the precision 
introduces the burden of examining the solution space 
in an unbounded manner. However, such an exhaustive 
search is not necessary for the purpose of protein struc- 
ture comparison, since coordinates of protein structures 
are specified only to three decimal places in the commonly 
used PDB format. 

In this section, we restrict the precision in which the 
input coordinates may be specified. Without loss of gen- 
erality, we assume that the input coordinates are given in 
integers, since numbers of any fixed precision can be triv- 
ially scaled up to integral values. This assumption is used 
to obtain Lemma 5 and Theorem 7. 

We also place an upper bound on the distance between 
consecutive points according to the structure of proteins. 
As a result, c max is bounded by n, as follows. 

Points drawn from protein structures have upper 
bounds on their diameters because they are connected, 
and many are globular. 

- For a connected structure, the points are at most 
0(«) distance apart. That is, c max is of 0{n), 

- For a globular structure, the points are at most 
0(« 1/3 ) distance apart [14]. That is, c max is of 
0(« 1/3 ). 

Given a point p, let the x coordinate of p be denoted x p . 
Similarly, we can define y p and z p . Without loss of gener- 
ality, we assume that the first point of a protein structure 
is at the origin. The largest coordinate of a protein is 
bounded by 0(«), and the largest coordinate of a globular 
protein structure is bounded by 0(n 1 ^ 3 ); that is 

max I v p I = O(n), if P is a protein structure (6) 

p€P,v=x,y,z 



max \v p \ = 0(k 1 / 3 ), if P is a globular protein structure 

P €P,\£3c,y,z 

(7) 

Our results show that, under these two conditions, 

1. the LCP problem is of similar difficulty as the MAD 
problem, and 

2. both problems can be solved exactly in polynomial 
time. 



Properties of protein structures 
Upper and lower bounds of RMSD 

We first establish some bounds to the RMSD. The mini- 
mum RMSD is zero if T brings Q to coincide exactly with 
P. This case is referred to as the exact matching, which 
can be easily solved by the method in [29]. However, if we 
assume the RMSD to be non-zero, then a lower bound and 
an upper bound for it can be computed. 

Let 7T be a permutation of {1, ...,£}. For the sequence 
X = {x\, . . . ,x„), let dfp 1 < i,j < n, denote the Euclidean 
distance between x, and Xj. The following results, which 
are proven in the Appendix, can be obtained. 

Lemma 4. 



Z^ i= i l a n(i),n(i+li/2}> a 7r(0,w((+L<V2)J 1 



RMSD(P, Q). 

Lemma 5 (Lower bound). If RMSD(P, Q) # 0, then 
RMSD (P, Q) > V^Lx-ViZcL*-^ 

V It 

Lemma 6 (Upper bound). RMSD(P, Q) < 4^/3c max . 

Using an algorithm for the LCP to solve the MAD problem 

Suppose there is a polynomial time algorithm for solving 
the LCP problem. To use it to solve the MAD prob- 
lem, we assume that d op t e[l,u], for some real / and u, 
I < u. We use a binary search strategy in the interval 
[l,u], as shown in Table 3, to search for the minimum 

Table 3 Employing an algorithm for the LCP problem to 
solve the MAD problem 

Input: sequences P = (pi,. . . ,p n ), Q = . .,q m ) and tel. 

Without loss of generality assume m > n. 
Output: (i) subsequences f cp^c Q, \p'\ = |Q'|, and 

(ii) mapping f : f (J, fulfilling the following conditions: 

(A) \P'\ = I, 

(B) d = RMSD(P',f(p' j) is minimized. 



/ <r- 0, U lC max 

m <- 1 /2(l + u) 

Call LCP to solve the instance (P, Q, m). 
If the LCP solution has size no less than t 

u <- m 
else 



\fu-l< 



4Tl 



Output the most recent LCP solution of size no less than l 
Otherwise, repeat Steps 2-5. 
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value such that the LCP solution size is I. However, the 
search will not terminate if an arbitrary accuracy of the 
d opt value is required. We prove below that the accuracy 
of d op t can be defined by polynomially many bits. Given 
two threshold t\ and ti, assume that we obtain two dif- 
ferent LCP solutions, and the RMSD values of the two 
solutions are 9\ and 82, where 9\ > 82. Similar to the argu- 
ments in Lemma 5, the difference between 9\ and 62 is 
at least ^ l2Cmax V 12c max 1 xherefore if two consecutive 
binary search operators have the difference of the thresh- 

/l2c 2 — /l2c 2 

old values below — — max jL — ^ — , the search can be 

terminated. The values of / and u are the same as in the 
previous subsection. Hence, 

Theorem 7. Solving the MAD problem is equivalent to 
solving 0(log £c max ) instances of the LCP problem. 

Since the reduction from the LCP problem to the MAD 
problem is obvious, we conclude that the two problems 
are of similar difficulty. 

Polynomial time algorithm 

We now show that under the two conditions, the LCP and 
MAD problem can be solved in polynomial time. 

As shown in the 'Complexity of the 1CP and mAD 
when the optimal superposition is known' section, when 
the optimal superposition is known, there are polynomial 
time algorithms for LCP and MAD. We consider an enu- 
meration of all the possible superpositions. Under the two 
conditions, we claim that there are at most polynomially 
many such superpositions. 

First, if we know P and Q, then optimal superposition 
can be computed in the following two steps: 

1. Translate P and Q such that their centroids are at the 
origin. 

2. Then, rotate Q to find the superposition with the 
minimum distance [26]. 

Denote the translations to obtain the optimal solution 
for P and Q as tp and tQ, respectively, and denote the 
optimal rotation by R. 

We now show that one needs to examine only poly- 
nomially many translations and rotation combinations to 
discover the values for tp, tQ, and R. These numbers can 
be effectively bounded by n when properties of protein 
structures are taken into account. We first describe these 
properties. 



Number of translations 

The centroid of P is 



To bring P to origin, the 



translation is £| — . Clearly, all the three coordinates 

of — J2p>£p P' are integers since all the coordinates of the 



points in p e P are integers. The value of ^-coordinate 

\ " I \ - \ - max 

of -hlfL is bounded by -h^flL < hl^l < 
Cmax- Similarly, all the three coordinates of the translation 

H / ' p' 

p | p are bounded within the interval [—c max ,c max ]. 

To obtain an optimal MAD solution, the translation on P 
must be in the form of \, where / is an integer. Since it is 
possible to examine all the possible values for /, we have 
the following result. 



Lemma 8. tp, tQ e - Ic, 



<I< 



max _ 1 — 



Kl 



Number of rotations 

With the centroids of P and Q translated to the origin, we 
proceed to identify the rotation in our algorithm. Let Xp 
denote the vector (x pi , x pe > for structure P. Similarly we 
define Yp and Zp. 
Let P t = (p[ - t, ...,p' e - t) and Q t = (#i - t, q*t ~ t), 

^.eP.^.eQ. a 

Given P and Q, to compute the rotation R, the first step 
is to create the 3x3 matrix, which is (from [26]) 



M = 



Yr, .X 



'Pt 



Y„ • Y, 



• z, 



Zf, • Xa Za • Ya. Za • Z. 



(8) 



-tQ " ¥ ~<tQ - ip QtQ / 

Each above matrix is decomposed by the singular value 
decomposition, and a rotation matrix is produced here- 
after. 

We know that the coordinate of each point in the protein 
is within the interval [ —c max , c max \. This implies that for 
U = X, Y,ZzndV = X, Y,Z, 



• Vq^ = ^(Pi,k ~ t P )(q j>k -t Q ) <J2 ( 2c maxf 



k=l 

<Mcl 



k=l 



Also, it is clear that Uf, • Va is in the form of I /I 2 , 

1 tp **tQ 

where / is an integer. The matrix in Equation 8 has nine 
elements; we denote e € M if e is one of these elements. 
The following lemma follows. 

Lemma 9. For each element e e M in Equation 8, e e 
{I/l 2 \-U\ 2 max <I<U\i ax }. 

Polynomial time algorithm 

To compute the optimal MAD solution, we first enumer- 
ate all the possible translations and rotations. A solution is 
computed for each translation and rotation combination 
according to Lemma 1. An optimal solution can be chosen 
from these computed solutions (Table 4). 

According to Lemma 8, the optimal translation tp and 
tQ must be within {I/t \ — lc max <I < £c max } 3 . To find the 
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Table 4 A polynomial time algorithm for the MAD problem 



Input: 


sequences P = (pi p„), Q = (qi q m ) and I e I. 




Without loss of generality assume m > n. 


Output: 


(i) subsets fcp^ca |P'| = 1(71, and 




(ii) mapping f : P 7 i-> Q', fulfilling the following conditions: 




(A) |P'| = 




(B) d = RMSD(P',f(p')) is minimized. 


1. 


For each translation f 6 [l/C\ - £c max < / < £c mox } 3 , 




For each 3x3 matrix M, where Ve e M, e e {l/l 2 \—, 




w 3 c 2 m<,x < 1 < 




Compute rotation matrix R from M. 




Qt-RQ-t. 




Apply an algorithm for the case where the superposition 




is known to P and Q (as discussed in the 'Complexity Of 




The LCP And MAD When The Optimal Superposition Is 




Known' section), and denote the solution MAD(P, Q). 


2. 


Output the MAD(P, Q) of the smallest RMSD as the solution. 



Regrettably, we do not see how the optimal solution can 
be obtained in both cases. 

On the other hand, we showed an encouraging result: 
There is a polynomial time algorithm which solves the 
problem optimally, if one restricts the input coordinates in 
the problem to be integral, and places a limit on the dis- 
tance between consecutive points. These requirements do 
not pose any restriction to typical uses in the analysis of 
protein structures, since protein structures are specified 
only to a fixed precision in practice, and there is an upper 
bound to the distance between protein residues. 

One problem is that our proposed polynomial time 
algorithm remains high in time complexity. We hope 
that the present work will provide the foundation for 
future efforts to obtain algorithms with lower runtime 
complexities. 

Appendix 

In this Appendix, we include the proofs of the results in 
the paper which have been omitted to enhance readability. 



optimal rotation matrix, it suffices that we try all the pos- 
sible values for each entry in Equation 8. Since there are 
^ 2?c max matrices, the number of total transformations to 
examine is bounded by 0(£ 33 c 34 ax ). It takes time 0(mn£) 
to identify the MAD solution for each transformation. An 
LCP solution can be obtained by iterating £ from 1 to 
min{m, «} for the MAD problem. 

The running time consists of the productions of three 
parts: the number of possible translations, the number of 
possible rotation matrix, and the running time for given 
a rotation matrix and a translation combination (that is, 
the running time when then transformation is known). 
These numbers are bounded by c max , which is bounded by 
m when we consider the properties of protein structures. 
Likewise, c max is polynomial with respect to the input size 
if coded in unary. 

Theorem 10. The MAD problem can be solved in 
0(£ 34 m 25 n) time for protein structures, and in 0(£ 34 m 9 n) 
time for globular protein structures. The LCP problem can 
be solved in 0(£ 35 m 2S n) time for protein structures, and in 
0(£ 35 m 9 n) time for globular protein structures. Both the 
MAD and LCP problems are pseudo-polynomially solvable 
for general point sets. 

Conclusions 

We studied the LCP problem under the RMSD in this 
paper. As it turns out, the difficulty of the problem does 
not lie in its combinatoric aspect or its structural superpo- 
sition aspect alone. That is, if the problem is hard, then it 
must be a consequence of both aspects. Our results show 
that if one is allowed to compromise on one of the aspects, 
then the problem is solvable exactly in polynomial time. 



Lemma 4. 



1 p 

Z^=i < a n(i),7T(i+lt/2)\ 



< RMSD(P, Q). 



*jr(0,7r(i+U/2)J 1 



Proof. Without loss of generality, we just show that 

r^E^ \4i + W2}-4+W2}\ 2 * RMSD(P, Q). 
Let 

n = \\T(qd - ptW 2 + \\T(qi+ie/2\) -Pi+[t/2\ II 2 . 

Ui = (Pi,Pi+[e/2}), and 

Vi = (qi,qi+a/2\), where 1 < i < |£/2J. 

First, we prove that n > \w - v;| 2 /2, for 1 < i < |£/2J. 
We first superimpose w,- and v; to optimize the squared 
sum; that is, to find transformation T such that \ \T(qi) — 
Pi\\ 2 + \\T{q i+ YL/2\) -p i+ \_i/2\\\ 2 is minimized. The cen- 
troids have to coincide to minimize the squared distance. 
Assume the centroids are at the origin and that the angle 

between (o,pi) and (o, qi) is a, where o is the origin, then 
by trigonometry 

\\T{qd -Pi\\ 2 + \\T{q i+vtm ) - p i+W2i \\ 2 
= 2[(l/2x || M i||) 2 + d/2x INI) 2 

-2xl/2||w/||xl/2||Vi|| x cosa] 

(ll^ll-lk-ll) 2 



r,- is the squared distance under transformation T, which 
may not be optimal for superimposing w; and v,-. There- 
fore, 
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n > 



k-ll-INI) 2 



Putting things together, we have 
RMSD(P,Q) > J— 



{Ui - V1) 2 + . . . + {uyt/ 2 \ ~ V[l/2}) 2 

21 



[i/2} 



2( ^2 \ d l,i+\_l/2\ d i,i+\_l/2\\ 2 
\j i=l 



□ 



Lemma 5. If RMSD(P,Q) # 0, then RMSD(P,Q) > 

s/2l 



Proof. If RMSD(P, Q) is non-zero, then there is at least 

pair of indices i and 
According to Lemma 4, 



a pair of indices i and 7, such that \\dfj — d^\\ > 0. 



RMSD(P,<t)>-^=J\<ll-^P 



2lV v <>/' 



□ 



Lemma 6. RMSD(P, Q) < 4V3c„ 



Proof. Denote the furthest point to the origin in P as 
Pmaxi and the furthest point to the origin in Q as q max - 
Then, 

£ x RMSD 2 (P,Q) 

1 

= ^IITfe)-/>«|| 2 
1=1 



< £(l !*-/><! I) 2 



i=l 



- ^2 max i\\<Imax ~ Pmax\\ 2 ,\\q max + Pmax 1 1 } 



i=l 



< £ max{ I PmaxW 1 Wtfmax PmaxW } 



□ 
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